simulation <- list(
model = "calibrated_model",
replications = 2,
tbl_parameter = expand.grid(
network = c("neutral", "partisan", "antipartisan", "mixed")
),
master_seed = 42
)
results <- run_simulation(simulation)Last week, we focused on calibration, allowing us to fine-tune our model parameters to closely match empirical data. With a calibrated model in hand, we can now run experiments in silico-simulating experimental conditions in a virtual environment. This step enables us to explore key design questions, refine our experimental setup, and improve the planning of real-world studies.
Consider the question of how people’s opinions are influenced by others. More specifically, we are interested in how different network topologies affect polarization within a population.1 We assume that the influence of a neutral source differs from the influence of someone with shared partisanship, and also from someone with opposing partisanship. This sets the stage for a in silicio experiment. Figure 1 shows four different experimental treatments that participants might encounter when learning about the opinions of others:
Each treatment reflects a different network topology a certain agent (ego, the red circle) is a part of. The experiment should proceeds as follows: first, participants’ initial opinions are recorded individually. Then, they are informed about the opinions of their neighbors, after which their opinions are measured again. In the first (neutral) condition, ego learns only the opinions of four others, without knowing their partisanship. In the other three conditions, ego also learns their neighbors’ party preferences—either all similar (partisan), all opposing (antipartisan), or a balanced mix (mixed).
We assume that neutral influence (\(\mu\)) is weaker than influence from like-minded individuals (\(\mu_P\)), while influence from those with opposing partisanship (\(\mu_A\)) is weaker still—or even negative. In short, we expect \(\mu_A < \mu < \mu_P\). Drawing on empirical findings from previous studies, we use calibrated values for these parameters (calibrated_model) to run our experiment in silico:
To analyze the simulation output, we compute polarization within each simulated world. For this, we calculate the average absolute opinion difference between the two partisan groups. The use of absolute differences reflects our focus on the magnitude of disagreement, not which group supports which view.
Show R Code (calc_polarization function)
library(glue)
calc_polarization <- function(
results,
table = "final"
) {
if (table == "final") {
df <- results$df_final_worlds
} else if (table == "init") {
df <- results$df_init_worlds
}
# Identify the columns defining each "world"
start <- which(names(df) == "replications") + 1
world_cols <- names(df)[start:length(names(df))]
n_opinions <- length(df$opinions[[1]])
# 4) extract each position and compute diffs
for (i in seq_len(n_opinions)) {
# pull the i-th element out of each list-column
df[[paste0("opinion_", i)]] <- map_dbl(df$opinions, ~ .x[i])
}
# Compute the absolute difference in average opinions by groups
result_list <- vector("list", length = n_opinions)
# Compute the absolute difference in average opinions by groups
for (i in seq_len(n_opinions)) {
opinion_col <- paste0("opinion_", i)
abs_diff_col <- glue("abs_diff_opinion_{i}")
result_list[[i]] <- df |>
group_by(rep, across(all_of(world_cols)), partisanship) |>
summarize(avg_opinion = mean(.data[[opinion_col]]), .groups = "drop") |>
pivot_wider(names_from = partisanship,
values_from = avg_opinion,
names_prefix = "opinion_") |>
mutate(!!abs_diff_col := abs(opinion_0 - opinion_1)) |>
dplyr::select(-opinion_0, -opinion_1)
}
# Combine all opinion positions into one dataframe
final_results <- bind_rows(result_list)
final_results
}Now we can use this to see how the different network topologies affect polarization:
df_polarization <- calc_polarization(results)
# show average differences between participants in the worlds
diffs <- df_polarization |>
group_by(network) |>
summarize(
abs_diff_mean = mean(abs_diff_opinion_1)
)
diffs# A tibble: 4 × 2
network abs_diff_mean
<fct> <dbl>
1 neutral 0.0870
2 partisan 0.124
3 antipartisan 0.168
4 mixed 0.149
The output reveals that average polarization is lowest in the neutral condition and highest in the antipartisan condition. Just one round of interaction is enough to increase the polarization by over 8%-points with a change from neutral to antipartisan condition.
We can also assess the effect size of these differences (using the library effsize):
library(effsize)
d <- cohen.d(
formula = abs_diff_opinion_1 ~ network,
data = df_polarization |> filter(network %in% c("neutral", "antipartisan")) |>
droplevels(),
pooled = TRUE,
hedges.correction = FALSE
)
d
Cohen's d
d estimate: -0.8140591 (large)
95 percent confidence interval:
lower upper
-5.291375 3.663257
The result suggests a medium-to-large effect. But is that enough to justify a real-world experiment based on these findings? While the simulated effects are promising, we need to know whether such differences would be statistically significant with real experimental data. A common approach is to run the same statistical test we would run later in the real experiment to test whether polarization is different between treatments. Lets assume we are content with a simple t-test. Let us use this test to compare now the effect between the neutral and the antipartisan condition:
ttest <- df_polarization |>
filter(network %in% c("neutral", "antipartisan")) |>
t.test(abs_diff_opinion_1 ~ network, data = _)
ttest
Welch Two Sample t-test
data: abs_diff_opinion_1 by network
t = -0.81406, df = 1.4419, p-value = 0.5281
alternative hypothesis: true difference in means between group neutral and group antipartisan is not equal to 0
95 percent confidence interval:
-0.7159995 0.5536910
sample estimates:
mean in group neutral mean in group antipartisan
0.0869696 0.1681238
Yet, the difference between the two treatments is not statistically significant (\(p < 0.528\))—even though the effect size is considerable. Why? Because we only have two worlds per condition. Remember: polarization is a property of the entire simulated world, not of individuals. So even if each world contains many agents, our actual sample size for the statistical test is just two worlds per treatment.
Clearly, this sample size is far too small to detect substantial effects with confidence. Had we proceeded with a real-world experiment based on such limited data, we might have wasted significant resources. To prevent this, we turn to power analysis. Power analysis allows us to estimate how many simulated worlds (or real-world participants) are needed to detect effects of a given size with a high probability. This ensures that the experiments we design are not only theoretically sound but also practically effective.
At the same time, as you may already know from earlier statistics courses, increasing the sample size sufficiently will always lead to statistical significance, regardless of effect size. Therefore, we aim for a sample size that is just sufficient to detect the effect sizes we believe to be realistic—while remaining insensitive to much smaller, potentially spurious effects. In other words, power analysis helps us find the minimum number of worlds (or more generally sample size) required to robustly detect meaningful differences, without overcommitting resources or overfitting to noise.
Learning Goals
- Define and interpret statistical power.
- Identify the main components affecting experimental power.
- Calculate power using both analytic formulas and simulation methods.
- Conduct sensitivity analyses to assess how robust your power calculations are to assumptions.
Statistical Power
Statistical power (henceforth simply power) is the probability that an experiment correctly identifies a true effect—it’s our ability to detect a meaningful signal amidst the noise. In any experimental setting, the signal we seek is the actual impact of a treatment. This could be:
- Does educational intervention increase income levels?
- Do vaccination campaigns reduce disease incidence?
- Which network topologies affect partisan polarization the most?
The noise refers to the natural variability or randomness in data, caused by many unpredictable factors that are difficult—or even impossible—to measure or control for statistically. It’s typically quantified as the standard deviation of outcomes. Consider the following examples:
- If your outcome is the prevalence of a rare disease, daily fluctuations are usually minimal—thus, noise is low. Even a modest decrease (e.g., 1 percentage point) in disease rates could be detectable.
- Conversely, measuring income as an outcome is inherently noisier, as individual incomes can differ greatly. A modest income increase of 1% might easily go unnoticed due to substantial background noise.
Adequate power ensures that a true effect—if it exists—won’t be masked by randomness.
Power and Specificity
When planning an experiment, two core statistical concepts must be carefully balanced: power and specificity (see Table 1). Understanding how they interact is crucial for designing experiments that are both scientifically rigorous and resource-efficient.
Specificity refers to the probability of correctly retaining the null hypothesis when it is actually true. In other words, it is the ability of a test to avoid false positives. The significance level \(\alpha\) defines how willing we are to risk a Type I error—rejecting the null hypothesis \(H_0\) when it is actually true. This is often set at 0.05, meaning we accept a 5% chance of a false positive. Accordingly, the specificity of such a test is \(1 - \alpha = 0.95\).
On the other hand, statistical power—denoted as \(1 - \beta\)—is the probability of correctly detecting a true effect. It represents the ability to avoid a Type II error (failing to reject \(H_0\) when it is false). The value \(\beta\) quantifies the probability of making such an error. Power is conventionally set at 80%, meaning we aim for \(1 - \beta = 0.8\), or in other words, we want to correctly detect true effects in 4 out of 5 experiments.
Balancing power and specificity is essential: increasing one often comes at the cost of the other. Power analysis helps us navigate this trade-off, ensuring that we choose a sample size large enough to detect meaningful effects, without inflating the risk of false positives.
| \(H_0\) is true | \(H_1\) is true | ||
|---|---|---|---|
| Test decision… | … in favor of \(H_0\) | ✅ Correct decision (Specificity) Probability: 1 − α |
❌ Type II error Probability: β |
| … in favor of \(H_1\) | ❌ Type I error Probability: α |
✅ Correct decision (Power) Probability: 1 − β |
Four Core Elements of Statistical Power
Four primary factors shape an experiment’s power:
The test itself and the statistical significance criterion The type of statistical test used—and the threshold set for statistical significance—directly influences power. Some tests are inherently more sensitive than others, depending on the data and research design. In addition, the choice of \(\alpha\) (commonly set at 0.05) defines the strictness of evidence required to reject the null hypothesis. A lower \(\alpha\) (e.g., 0.01) reduces the chance of false positives but also makes it harder to detect true effects, thereby reducing power. Conversely, increasing \(\alpha\) boosts power but raises the risk of false positives.
Strength of Treatment
Stronger treatments are easier to detect and therefore increase statistical power. For example, giving each participant a large sum of money is more likely to produce a measurable effect than giving only a small sum. However, in practice, researchers often face constraints—whether due to limited resources or limited control over the intervention—especially when evaluating existing programs or policies.Background Noise (Variability)
The more variability there is in outcomes, the harder it is to detect a true effect. Selecting outcome measures with lower variability can help improve power. In many cases, however, variability is inherent and difficult to reduce—especially when measuring complex behaviors like income or attitudes. This is precisely one of the reasons researchers turn to lab experiments: to create controlled environments that minimize noise. By standardizing conditions across participants, lab settings help keep variability low, making it easier to isolate and detect the effect of interest.Experimental Design
This is (often) the most direct lever researchers have to influence power. Power analysis often focuses on sample size—larger samples reduce uncertainty and increase power. But good design goes beyond just size. Consider:- How units are randomized (individually or in clusters),
- Whether key covariates are measured and controlled for,
- How many treatment groups are included and how they’re structured.
Time To Play (and Learn!)
In the interactive graph below, you can explore how the key parameters of hypothesis testing—significance criterion \(\alpha\), difference in mean \(\mu_1\), and standard deviation \(\sigma\)—interact to shape the probabilities of different outcomes.2 By adjusting the sliders, observe how the rates of Type I errors (false positives) and Type II errors (false negatives) change.
- The blue shaded areas represent Type I errors: regions where we incorrectly reject the null hypothesis (\(H_0\) is true, but we reject it).
- The red shaded area represents a Type II error: failing to reject the null hypothesis when the alternative is actually true.
- The remaining areas under the curves represent correct decisions: either correctly retaining the null or correctly detecting a true effect.
In an ideal scenario, we aim to minimize both the blue and red areas—reducing the likelihood of false discoveries and missed effects. Try adjusting the parameters to see how you can shrink both error regions simultaneously!
Significance criterion \(\alpha\):
Difference in mean \(\mu_1\):
Standard Deviation \(\sigma\):
As you will notice, all three parameters—significance criterion \(\alpha\), difference in mean \(\mu_1\), and standard deviation \(\sigma\)—impact the probability of a Type II error (\(\beta\)). However, they are not equally advisable to manipulate from a scientific perspective.
While increasing the significance criterion \(\alpha\) can improve power by reducing \(\beta\), doing so comes at the cost of raising the false positive rate. This compromises the integrity of your results and increases the risk of contributing to irreproducible findings. Good scientific practice discourages simply relaxing \(\alpha\) to gain statistical significance.
Instead, maintaining a conventional threshold (e.g., \(\alpha = 0.05\)) ensures that evidence required to reject the null hypothesis remains stringent. This protects the credibility of statistical inference and helps prevent the overstatement of weak or spurious effects—one of the central concerns in the current replication crisis across many disciplines.
Therefore, we should seek to influence \(\beta\) by adjusting parameters that improve our experimental design without compromising statistical integrity—namely, increasing effect size (e.g., through better treatment design), reducing variability (e.g., by improving measurement precision), or increasing sample size (e.g., adding more participants or simulation runs).
Conducting a Power Analysis
In the previous section, we explored several key design choices—such as sample size, randomization strategy, and control of covariates—that can greatly influence an experiment’s statistical power. To make informed design decisions, however, we must first understand how to quantify power. This is where power analysis comes in.
There are two primary approaches to conducting a power analysis:
- Analytical methods, which use mathematical formulas derived from statistical theory to estimate power.
- Simulation-based methods, which involve generating synthetic datasets under specified assumptions and repeatedly testing them to observe how often a statistically significant result occurs.
Each approach has its advantages. Analytical formulas are fast, interpretable, and useful for standard designs—but they depend on idealized assumptions (e.g., normality, equal variances). Simulation, by contrast, is computationally more intensive but highly flexible, making it better suited for complex or customized scenarios.
In the sections that follow, we will explore both methods in detail and discuss how to apply them to real experimental planning.
The Analytical Way
There are many formulas for calculating statistical power, each tailored to a specific type of test (e.g., t-tests, ANOVA, regression). To keep things simple, we’ll focus on the classical two-sample, two-sided t-test, which compares the means of two independent groups.
For this case, the power (\(1 - \beta\)) can be approximated using the following formula:
\[ 1-\beta \approx 1 - \Phi\left( \Phi^{-1} \left(1-\frac{\alpha}{2}\right) - \frac{|\mu_t-\mu_c|\sqrt{N}}{2\sigma} \right) \]
Here’s what these symbols mean:
- \(\beta\): Probability of making a Type II error.
- \(\Phi\): Cumulative distribution function (CDF) for the normal distribution.
- \(\Phi^{-1}\): The inverse normal distribution function.
- \(\mu_t\): Mean outcome of the treatment group.
- \(\mu_c\): Mean outcome of the control group.
- \(\sigma\): Standard deviation (noise) of outcomes, assumed equal across groups.
- \(\alpha\): Significance level (commonly set at 0.05).
- \(N\): Total sample size.
Note that this approximation becomes more accurate as \(N\) increases, because under the alternative hypothesis (\(H_1\)), the Student’s t-distribution converges to the standard normal distribution.
Let’s apply this formula to estimate the power of our experiment with \(N = 4\) (2 simulated worlds per condition):
power_calculator <- function(mu_t, mu_c, sigma, alpha=0.05, N, two.sided = TRUE){
if (!two.sided) alpha <- 2 * alpha
inside <- qnorm(1 - alpha/2) - abs(mu_t - mu_c) * sqrt(N) / (2 * sigma)
1 - pnorm(inside)
}
# calculate the sd's for our two experimental conditions
sigma <- df_polarization |>
filter(network %in% c("neutral", "antipartisan")) |>
pull(abs_diff_opinion_1) |>
sd()
# in diffs[1,2] and diffs[3,2] the values for neutral and antipartisan are stored!
power_calculator(
mu_t = diffs[1,2] |> pull(),
mu_c = diffs[3,2] |> pull(),
sigma = sigma,
alpha = 0.05,
N = 4
) [1] 0.1365657
We see that with \(N = 4\), we can only expect about 13.7% power—meaning that in roughly 86% of similar experiments, we would fail to detect a real effect, despite a substantial difference between groups.
How many observations do we need? For a quick approximation, we can use Lehr’s rule of thumb (1992), which estimates sample size for an 80% powered, two-sided t-test at \(\alpha = 0.05\) as:
\[ n \approx 16 \cdot \frac{\sigma^2}{(\mu_c-\mu_t)^2} \] In this formula, \(n\) represents the required sample size per group. Since we are comparing two groups, the total sample size is \(N = 2 \cdot n\). Let’s apply this rule in R:
approx_n <- function(
mu_t,
mu_c,
sigma
){
16 * (sigma^2 / (mu_t - mu_c)^2)
}
approx_n(
mu_t = 0.1681238,
mu_c = 0.0869696,
sigma = sigma
)[1] 21.4293
This yields an estimated \(n \geq 22\) to achieve 80% power. This means in the end we must sample 44 worlds at least.
While this is helpful for a rough estimate, it’s not advisable to rely solely on such approximations. To improve accuracy, we can use the WebPower package, which offers more precise calculations and supports a range of designs, including unequal group sizes, paired samples, and more.
Let’s use WebPower to plot how power increases with \(N\), using the previously calculated effect size as an anchor.
#install.packages("WebPower")
library(WebPower)
# we calculated the d earlyer with cohens.d
res <- wp.t(n1=seq(20, 30, by=1), d=d$estimate, type="two.sample", alternative="two.sided")
plot(res)
abline(h = 0.8, lty = 2, col = "red")We observe that while Lehr’s rule suggests \(n = 22\), a more precise estimate shows we would actually need at least \(n = 25\) to achieve 80% power in our experiment.
Trusting Your Power Analysis
Power analyses inherently rely on uncertain assumptions. Critics might point out this circular reasoning: you use guesses about effect size and variability to determine how likely you are to detect effects.
However, power analysis is a powerful tool precisely because it reveals sensitivity to these assumptions. You can easily explore how your conclusions change with variations in assumed effect size, sample size, and noise.
For instance, vary the sample size (\(N\)) to observe how power changes:
- Larger samples increase power.
- Smaller effect sizes or higher variability reduce power.
Conducting sensitivity analyses this way helps you make informed, transparent decisions about experimental design.
It’s also worth noting that our current estimates rely on just a few simulations (e.g., \(N = 4\) total worlds for comparing neutral vs. antipartisan conditions). This low number limits the reliability of our estimates. To obtain more trustworthy results, we should next turn to simulation-based power estimation—directly measuring how often our experiment would succeed, given our current assumptions.
Simulation: A Practical Alternative to Formula-Based Power Analysis
Traditional formulas offer quick insights, but modern computational methods—such as simulations—can provide richer, more flexible analysis.
Simulation involves:
- Repeatedly generating hypothetical datasets based on your assumptions.
- Conducting statistical tests on these simulated datasets.
- Calculating the proportion of simulated experiments detecting significant effects.
This approach makes fewer assumptions and lets you explore complex scenarios realistically, especially useful for advanced or novel experimental designs.
Simulations thus enhance your understanding of an experiment’s robustness and strengthen the validity of your experimental findings.
Implementation in R
Under normal circumstances—when we’re interested in individual-level outcomes—our run_simulation() function would suffice for estimating the required \(N\). We could vary \(N\) directly in tbl_parameter and track when the treatment becomes significant in more than 80% of the replications.
However, in our case, we’re interested in a property of the world itself (i.e., polarization), not in individual behavior. In this context, the replications used in run_simulation() actually correspond to the number of independent worlds (or experimental sessions) we simulate per treatment.
A clean and robust solution would be to write a run_meta_simulation() function that wraps around run_simulation() and handles repeated simulation for each sample size, collecting p-values across runs. However, such an implementation is complex. Instead, we’ll take a shortcut and slightly modify how we use our existing run_simulation():
replications = 100
worlds_per_treatment_max = 40
simulation <- list(
model = "calibrated_model",
replications = worlds_per_treatment_max * replications,
tbl_parameter = expand.grid(
network = c("neutral", "antipartisan")
),
master_seed = 42
)Here we define 100 replications and 40 as the maximum number of worlds per treatment we want to simulate.
Now we run this simulation:
meta_results <- run_simulation(simulation)The idea is as follows: replication 1 to 100 corresponds to the first world, 101 to 200 to the second world, and so on. If we want to compare, for example, 10 neutral worlds with 5 antipartisan worlds, we can take replications 1, 101, 201, …, 901 and compare those to the corresponding replications for the other condition.
To automate this, we write a function that calculates power by looping over replications, collecting p-values, and computing the share of tests that fall below our \(\alpha\) threshold:
get_power_from_simulation <- function(
df,
n_1,
n_2,
by_var, # unquoted column name, e.g. network
by_val1, # value of by_var for group 1, e.g. "neutral"
by_val2, # value of by_var for group 2, e.g. "antipartisan"
replications
){
by_var <- enquo(by_var)
# Pre‐filter the two condition sets once:
df_by <- df |> filter((!!by_var) == by_val1 |(!!by_var) == by_val2)
rejections <- vapply( seq_len(replications), FUN.VALUE = logical(1), function(i)
{
idx1 <- seq(from = i, by = replications, length.out = n_1)
idx2 <- seq(from = i, by = replications, length.out = n_2)
df_sel <- df_by |>
filter(
((!!by_var) == by_val1 | rep %in% idx1) &
((!!by_var) == by_val2 | rep %in% idx2)
)
run_test(df_sel)
}
)
# proportion of rejections = estimated power
mean(rejections)
}
run_test <- function(df){
t.test(abs_diff_opinion_1 ~ network, data = df, alternative = "two.sided")$p.value < 0.05
}Note that we defined a small helper function, run_test(), which performs a \(t\)-test. By isolating the test logic in its own function, we make it easy to later substitute alternative statistical tests—simply by modifying this subfunction—without needing to change the surrounding analysis pipeline.
We can now apply this function to estimate power for 10 neutral and 5 antipartisan worlds:
df_polarization <- calc_polarization(meta_results)
get_power_from_simulation(
df = df_polarization,
n_1 = 10,
n_2 = 5,
by_var = network,
by_val1 = "neutral",
by_val2 = "antipartisan",
replications = 100
)[1] 0.23
Not surprisingly, power is still low—around 23%. But now, let’s systematically explore how power increases as we raise the number of worlds per treatment:
Show R Code (Power plot)
sample_size <- seq(10, 30, by = 1)
df <- tibble(sample_size = sample_size) |>
mutate(
power = map_dbl(sample_size, ~
get_power_from_simulation(
df = df_polarization,
n_1 = .x,
n_2 = .x,
by_var = network,
by_val1 = "neutral",
by_val2 = "antipartisan",
replications = 100
)
)
)
library(plotly)
plot_ly(df,
x = ~sample_size,
y = ~power,
type = 'scatter',
name = 'Power by n',
mode = 'lines+markers',
hoverinfo = 'text',
text = ~paste0('N = ', sample_size, '<br>Power = ', round(power, 2))
) %>%
add_lines(x = ~sample_size,
y = rep(0.8, nrow(df)),
line = list(dash = 'dash', color = 'red'),
name = '80% Power') %>%
layout(
title = 'Power vs. Sample Size',
xaxis = list(title = 'Sample Size (N)'),
yaxis = list(title = 'Power'),
showlegend = TRUE
)We observe that even with a symmetric sample size of 19 worlds per condition or an \(N=38\), we achieve over 80% power. This is a bit lower than our earlier approximation using analytical methods—likely because our original estimate was based on only 4 observations to calculate \(\sigma\), introducing higher variability.
We can inspect the effect size again for the the complete run (all 8000 worlds):
cohen.d(
formula = abs_diff_opinion_1 ~ network,
data = df_polarization |> filter(network %in% c("neutral", "antipartisan")) |>
droplevels(),
pooled = TRUE,
hedges.correction = FALSE
)
Cohen's d
d estimate: -0.7979546 (medium)
95 percent confidence interval:
lower upper
-0.8434983 -0.7524109
We see that this is not far off what we initially estimated.
This simulation-based approach gives us a robust, assumption-light estimate of the power under realistic experimental conditions. It’s particularly valuable when classical formulas are too simplistic for the design at hand.
Why Use Unequal Sample Sizes?
Throughout this chapter, we have primarily assumed equal sample sizes for treatment and control groups (i.e., \(n_0 = n_1\)). This design is indeed statistically optimal: comparing 10 vs. 10 yields more power than, for example, 19 vs. 1. In general, power is maximized when group sizes are balanced, assuming all else is equal.
However, there are several situations in which using unequal sample sizes is not only acceptable, but actually advisable.
1. Cost Constraints
In some experiments, one treatment group may be significantly more expensive to implement than the other. For example, if administering the treatment requires costly resources—say, specialist time, hardware, or financial incentives—while the control group involves only basic measurement, then a balanced design may be impractical.
Instead of running a 10 vs. 10 trial, you might run 15 vs. 8. Although this introduces some statistical inefficiency, it may still be preferable if it leads to lower overall cost. The trade-off here is between statistical power and budget efficiency: slightly reduced power may be acceptable if it enables more feasible execution.
2. Logistical or External Constraints
Sometimes, sample sizes in one group are fixed by circumstances beyond the researcher’s control. For example:
- You may only have access to a limited number of treated units (e.g., policy participants, rare populations, or invited experts).
- The treatment group could be capped due to institutional or ethical limits.
- Recruitment channels might produce uneven inflows into treatment and control.
In such cases, the sample size for one group is fixed, and the researcher must calculate the required sample size for the other group to achieve the desired power.
3. Adaptive Designs and Reweighting
Unequal sample sizes may also emerge as a result of adaptive experimental designs, where assignment probabilities shift over time (e.g., in multi-armed bandits). Or, they may reflect intentional oversampling of one group to ensure adequate representation or variance for subgroup analysis.
Statistical techniques such as weighted regression or generalized estimating equations can help correct for imbalances, especially in observational or quasi-experimental contexts.
In summary, while equal sample sizes provide maximum statistical efficiency, real-world considerations—such as cost, availability, and ethics—can justify unequal designs. In such cases, power analysis becomes especially important to determine how best to allocate limited resources while still preserving the ability to detect meaningful effects.
Increasing Power Further
Until now, we have primarily focused on increasing power by adjusting sample size—arguably the most straightforward and commonly used approach. However, as previously discussed, sample size is not the only lever available to researchers. Another highly effective strategy is to use the most efficient test statistic for the research question and data structure at hand.
But what exactly does that mean? To understand this, let’s reconsider the structure of our experiment:
\[ R \longrightarrow T_0 \longrightarrow Y_0 \newline R \longrightarrow T_1 \longrightarrow Y_1 \] In this setup, participants are randomized (\(R\)) into either a control group (\(T_0\), neutral) or a treatment group (\(T_1\), antipartisan), and we then measure the outcome—average partisan difference in opinions—after the treatment.
However, our actual experimental design contains more information:
\[ R \longrightarrow Y_0^{t_0} \longrightarrow T_0 \longrightarrow Y_0^{t_1} \newline R \longrightarrow Y_1^{t_0} \longrightarrow T_1 \longrightarrow Y_1^{t_1} \] That is, we also collect an initial measurement of opinions before treatment assignment, allowing us to observe polarization at baseline (\(Y^{t_0}\)), in addition to the post-treatment outcome (\(Y^{t_1}\)).
This pre-treatment measurement allows for a more precise analysis. Instead of relying solely on a simple \(t\)-test comparing post-treatment outcomes, we can now use a regression model that adjusts for baseline polarization. This approach not only increases statistical power but also helps account for initial differences between groups.
Fortunately, our calc_polarization() function supports this by allowing us to extract both final and initial polarization measures:
df_polarization <- calc_polarization(meta_results) |> # get final polarization
bind_cols(
calc_polarization(meta_results, table = "init") |> # get inital polarization
rename(abs_diff_opinion_1_init = abs_diff_opinion_1) |>
dplyr::select(abs_diff_opinion_1_init)
)Next, we redefine our run_test() function to include abs_diff_opinion_1_init as a covariate in the regression:
run_test <- function(df){
fit <- lm(abs_diff_opinion_1 ~ network + abs_diff_opinion_1_init,
data = df
)
coefs <- summary(fit)$coefficients
# coefficient name for group effect
coef_name <- "networkantipartisan"
p_val <- coefs[coef_name, "Pr(>|t|)"]
p_val < 0.05 # alpha
}With this updated model in place, we can now rerun our power analysis using regression instead of a \(t\)-test:
Show R Code (Power plot)
sample_size <- seq(2, 15, by = 1)
df <- tibble(sample_size = sample_size) |>
mutate(
power = map_dbl(sample_size, ~
get_power_from_simulation(
df = df_polarization,
n_1 = .x,
n_2 = .x,
by_var = network,
by_val1 = "neutral",
by_val2 = "antipartisan",
replications = 100
)
)
)
library(plotly)
plot_ly(df,
x = ~sample_size,
y = ~power,
type = 'scatter',
name = 'Power by n',
mode = 'lines+markers',
hoverinfo = 'text',
text = ~paste0('N = ', sample_size, '<br>Power = ', round(power, 2))
) %>%
add_lines(x = ~sample_size,
y = rep(0.8, nrow(df)),
line = list(dash = 'dash', color = 'red'),
name = '80% Power') %>%
layout(
title = 'Power vs. Sample Size',
xaxis = list(title = 'Sample Size (N)'),
yaxis = list(title = 'Power'),
showlegend = TRUE
)As shown in the plot, adjusting for baseline polarization dramatically improves the power of our experiment. With this more sophisticated approach, a sample size of just \(n = 4\) per group is already sufficient to surpass 80% power. At \(n = 5\), power approaches nearly 100%.
This is a significant improvement over our initial \(t\)-test-only approach—and one that could save considerable time and resources. It underscores the value of leveraging all available data and using appropriate statistical models to maximize efficiency in experimental design.
🧩 Conclusion
In this chapter, we explored how power analysis serves as a critical tool for designing effective and efficient experiments. We began by revisiting the statistical foundations—clarifying the trade-off between power (\(1 - \beta\)) and specificity (\(1 - \alpha\))—and showed how low-powered studies can lead to ambiguous or misleading conclusions.
We then examined both analytical and simulation-based methods for calculating statistical power. While analytical formulas offer quick approximations, simulation-based approaches provide the flexibility needed for more complex or realistic experimental designs—especially in agent-based modeling contexts.
Using our own calibrated opinion dynamics model, we demonstrated how simulation can help estimate the number of “worlds” required to detect meaningful treatment effects. We also showed how controlling for pre-treatment measurements—such as initial polarization—can significantly increase statistical power, allowing us to draw stronger conclusions with fewer resources.
In practice, power analysis supports better planning, more transparent preregistration, and stronger scientific claims. Whether you’re preparing a grant proposal, designing a pilot, or running a full-scale experiment, power analysis ensures that your study is not only feasible but also meaningful.
Ultimately, power analysis is not just a technical requirement—it’s a safeguard for scientific integrity and a cornerstone of responsible research design.
References
Footnotes
Note that polarization, unlike an opinion, is not an individual property. Instead, it reflects a collective feature of the system. In agent-based modeling terms, polarization is a property of an entire world, not of any one agent.↩︎
In this interactive graph, \(\sigma\) does not only represent standard deviation in the classical sense, but can also serve as a proxy for sample size. Increasing the sample size reduces uncertainty around the estimated mean, effectively shrinking the observed standard deviation due to the law of large numbers. In that sense, decreasing \(\sigma\) here mimics the effect of increasing the number of observations.↩︎